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1. Relating empirical and real structures: additive and multiplicative re- 
sults. The key issue investigated in Vladimir Koltchinskii's paper is the 
behavior of an empirical minimizer / E F, that is, a function / in F with 
minimal sample average, 

1 n 

Pn/=-E/(^)> 
n f— • 
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where X\, . . . ,X n are drawn i.i.d. from a probability measure P on X and 
F is a class of real- valued functions defined on X . The study of bounds 
on the expectation Pf arises in many applied areas, including the analysis 
of randomized optimization methods involving Monte Carlo estimates of 
integrals. Motivated by prediction problems that arise in machine learning 
and nonparametric statistics, the paper makes an important contribution 
to the study of these bounds, and to the development of model selection 
methods that exploit the bounds. 

The broad approach taken in this paper, and in much earlier work, is to 
show that the empirical structure (i.e., the collection of sample averages, 
P n f) is close to the real structure (i.e., the collection of expectations, Pf). 
If they are close in the additive sense that \\P n — P\\f decreases at some 
rate, then it is clear that Pf approaches mlf^pPf at that rate. As the 
paper recalls, there is a tight relationship between the Rademacher process 
indexed by coordinate projections of the class F and this additive notion of 
closeness of empirical and real structures. Also, it can be advantageous to 
consider these properties only locally, that is, in the set F{5) C F of near- 
minimizers of Pf. In particular, if the variance of elements of F(S) goes to 
zero with 5, then faster rates are possible through the study of these local 
properties. 
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An alternative, developed in the paper, is closeness in the multiplicative 
sense that for < e < 1, for all functions / in F that have expectations not 
too small, 

(l-e)P n f<Pf<(l + e)P n f. 

Again, these results rely on the variance of an element of F decreasing 
as its expectation decreases. Let us call a class that has this property a 
Bernstein class. 

Definition 1.1. We say that F is a (/?, i?)-Bernstein class with respect 
to the probability measure P (where < (3 < 1 and B > 1), if every / in F 
satisfies 

Pf < B{Pff. 

This condition arises naturally in many situations, as the paper describes. 
Obviously, if F consists of nonnegative functions bounded by 6, then F is a 
Bernstein class (with [3 = 1) with respect to any probability measure. Other 
examples arise for excess loss classes, 

F = {£g-£ g .:geG} with £ g (x,y) = £(g(x),y), 

where I : M 2 — > [0, oo) is a loss function and g* G G minimizes P£ g - For ex- 
ample, in regression, if a i— > £(a,y) are uniformly convex Lipschitz bounded 
functions and G is convex, then F is a Bernstein class [2, 5, 6]. In pattern 
classification with I the discrete loss, if g* is the Bayes rule and the condi- 
tional probability Pr(Y = 1\X) is unlikely to be near 1/2, then the excess 
loss class is Bernstein [10]. 

Under some mild assumptions on a Bernstein class F, there is a simple 
proof of the multiplicative closeness of the empirical and true structures, 
using Talagrand-style concentration inequalities for empirical processes [9]. 
The assumptions are that functions in F are bounded, and that F is star- 
shaped around 0, that is, for every < a < 1 and any f E F, af 6 F. A 
generalization of the following result (for arbitrary Bernstein conditions) 
appears in [3]. 

Theorem 1.2. There exists an absolute constant c for which the follow- 
ing holds. For F a (1, B)- Bernstein class of functions bounded by b which is 
star-shaped around 0, with probability at least 1 — e~ x , the empirical mini- 
mizer f £ F satisfies 

Pf < max jinf{r > : £ re (r) < r/4}, ^ + n }, 

where 

U(r) = Esup{P/ -Pnf:fe F r } with F r = {f€F:Pf = r}. 
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The proof uses a simple geometric argument: Talagrand's inequality im- 
plies that, for the subset F r with r not too small, there is a near-equivalence 
between the multiplicative comparison inequality 

(l-e)P„/<P/<(l + e)P»/ 

holding uniformly over F r , and the expectation of the supremum of the 
empirical process, E||P — P n ||,F r being less than re. And then the star-shaped 
property shows that this extends to all functions in F that have Pf > r. 
The reason that the critical level r cannot be too small is because by the 
star-shape property, the "relative complexity" of the sets F r increases as r 
decreases. 

Notice that this result is in terms of the fixed point 

inf{r > : £ n (r) < r/4} 

which is never larger than fixed points of the related functions in Koltchin- 
skii's paper. In particular, £ n (r) is bounded by E||P — Pn||F r5 the expected 
supremum of the empirical process indexed by functions that have expec- 
tation r, whereas the paper considers expected suprema over larger sets, 
defined by the ^(P) structure. 

2. Data- dependent bounds and model selection. One of the appealing 
features of these kinds of bounds, which is developed in Koltchinskii's paper, 
is that there are empirical versions that show that we can accurately estimate 
the bounds using the sample. It turns out that this is also the case for the 
result of Theorem 1.2; see [4]. The idea is to replace the quantity £n(r) = 
E||P — P n ||F r with a sample-based estimate of the corresponding Rademacher 
averages, 

Ur) = R n (F r ) with F r = {/ G F : c x r < P n f < c 2 r}, 

for some constants c\ < 1 < c%. The same concentration properties that imply 
the bounds in terms of the fixed point of show that a fixed point of 

£n( r ) + c 3 r a ^ so suffices. 

Another interesting contribution in the paper is the application of these 
bounds in terms of empirical quantities to model selection problems. It is 
natural to consider how estimates of expectations (i.e., estimates of risk, 
in the case of loss classes) can be used to define penalization methods for 
model selection. In particular, define the risk P£f = P£(Y, f(X)), where £ is 
a nonnegative loss function and (X,Y) is a covariate/response pair. Suppose 
that we have a sequence Fi,F2, ■ ■ ■ of function classes defined on X, and we 
use an estimator that first chooses the empirical minimizer 



fk =argminP n £/, 
feF k 
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from each F k , and then picks f = ft as the f k that minimizes a penalized 
risk of the form 

P n i k +p{k). 

A key concern in these problems is proving oracle inequalities of the form 
Pr(P£f > inf{hrf P£ f +p(k)}) - 0, 

where p(k) is a complexity penalty related to the penalty p(k) used by the 
method. Notice, in particular, the constant multiplying the risk (P£f) term is 
1. It turns out that if the classes are ordered by inclusion, then multiplicative 
bounds for the excess loss class immediately give such oracle inequalities. 
The multiplicative bounds we need are of the form 

V/eF, (l-e)P n f-r<Pf<{l + e)P n f + r. 

(Notice that, although upper bounds of this kind are immediate from the 
proof of Theorem 1.2, the lower bounds are not.) The following theorem is 
elementary; it is proved in [1]. Define ft as the element of F k that minimizes 
P£ f . 

Theorem 2.1. Suppose that 

sup sup (P£ f - P£ n - 2(P n £ f - P n £ n ) - e k ) < 0, 
k feF k 

sup sup (PJ f - P n £ n - 2{P£f - P£ r ) - e k ) < 0, 
k feF k 

where the classes are ordered by inclusion, and the quantities e k are similarly 
ordered, F\ C F2 C F3 C • • •, e± < £2 < £3 < ■ ■ •■ Then choosing p{k) = 7£ k /2 
ensures that 

P£ f <M(P£ r +9e k ). 

J k ■ k 

3. Lower bounds. It is interesting to consider the tightness of the upper 
bounds of the type proved in the paper. Koltchinskii provides examples that 
demonstrate optimal rates in several minimax settings. But is it true that, 
for all function classes and probability distributions, the upper bounds imply 
the correct rate of convergence of Pf to its asymptotic value? 

It turns out that they are not tight. Indeed, in attempting to prove match- 
ing lower bounds, we were led to the following theorem (see [3]), which uses 
a direct analysis of the empirical minimizer to give essentially matching up- 
per and lower bounds on its expectation, in terms of a related property of 
the empirical process. Set 

£„(r) = E sup Pf - P n f where F r = {feF:Pf = r}, 

/6E r 
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and, for e > 0, define 

r £i+ = supjo < r < b : £ n (r) - r > sup(£ n (s) - s) - ej, 

r £ _ = infjo <r <b: £ n (r) - r> sup(£ n (s) - s) - ej. 

These two quantities bracket the range of values of r that e-approximately 
maximize the function £ n (r) — r - The theorem shows that, for e not too 
small, they also bracket the expectation of the empirical minimizer. 

Theorem 3.1. For any c\ > 0, there is a constant c such that the follow- 
ing holds. Let F be a (1,B)- Bernstein class that is star-shaped at 0. Define 
s, r £)+ and r e - as above, and set 

i [or j- / \ , c(b + B)(x + log n)} 
r' = maxjinf{r>0:£n(O< r / 4 }>- ^ J" 

Let f denote an empirical risk minimizer. If 

. / f (t ,s , ^ (B + b){x+\ogn) \ l l 2 

e > c max sup(£ n (sj — s),r > 

V l s>o > n 

then 

1. With probability at least 1 — e~ x , 



Ef < max<{ -,r £j+ 



n 



2. If 



£ n (0,ci/n) <sup(^ n (s) -s)-e, 

s>0 



then with probability at least 1 — e x , 



The following theorem (see [3] ) shows that there is a real gap between this 
result and the bounds in terms of fixed points of described in Theo- 

rem 1.2 and thus between this result and the similar bounds in Koltchinskii's 
paper. 

Theorem 3.2. There is an absolute constant c for which the following 
holds. If < 6 < 1 and n > Nq(5), there is a probability measure P and 
a star-shaped class F, which consists of functions bounded by 1 and is a 
(1,2) -Bernstein class, such that 
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1 . For every X\ , . . . , X n there is a function f £ F with E/ =1/4 and 
E n / = 0. 

2. For the class F, inf{r > : £ n (r) < r/4} = 1/4. 

3. If f is a p- approximate empirical minimizer, where < p < 1/8, i/ien 
probability larger than 1 — 5, 

ifi-J^./UE/a. 

n y V n I n 

So there is an example in which Theorem 3.1 demonstrates that Pf is 
of order 1/n, but the local Rademacher bounds are constants. Although 
the example is of a class F, it is straightforward to show that, under mild 
conditions on a loss function £, this class can be written as an excess loss 
class {£ g — £ g * ■ g € G} for some G and some probability distribution (see [3] ) . 

We have seen that we can obtain a data-dependent version of the local 
Rademacher bounds that can be used as complexity penalties in model se- 
lection methods. If the same thing were true for the bounds of Theorem 3.1, 
we could improve on these model selection methods. Unfortunately, this 
is not possible if one only has access to function values on finite samples. 
There is an example in [4] that shows that it is impossible to establish a 
data-dependent upper bound on the expectation of the empirical minimizer 
that is asymptotically better than the fixed point of £n(?")- The idea is to 
construct two classes of functions that look identical when projected on any 
sample of finite size, but for one class both a typical expectation of the em- 
pirical minimizer and the fixed point of £ n ( r ) are of the order of a constant, 
while for the other a typical expectation is of the order of 1/n. 

4. The role of concentration. Arguably, the most important contribu- 
tion to modern prediction bound techniques is Talagrand's concentration 
inequality for empirical processes [9]. However, it is important to note that 
its full strength is rarely used. 

Roughly speaking, this inequality ensures that with high probability, the 
dominant terms in the upper and lower estimates on \\P n — P\\f are (1 + 
a)E||P n — P\\f and (1 — a)E||P n — P\\f, where a can be made arbitrarily 
close to 0, at a price of larger second-order terms. In fact, in the vast majority 
of results one can take a to be any fixed constant < a < 1. 

The important point is that in multiplicative- type results (e.g., ratio- limit 
theorems as presented in the paper or similar to Theorem 1.2), the role of 
this coefficient is not important. It is only when one wishes to analyze the 
behavior of the empirical minimizer on the set F r and compare it to its 
behavior on F s for r/s that the exact dependency on a is required. This 
is the case in the proof of Theorem 3.1. 
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Moreover, in the vast majority of results that do not involve multiclass 
analysis, the actual role of Talagrand's concentration inequality is restricted 
to ensuring a better dependency on the confidence level 5 — from poly- 
nomial in 1/5 to logarithmic in 1/5. Indeed, an almost identical result to 
Theorem 1.2 can be proved without Talagrand's inequality, leading to the 
same order of error rates but with a worst constant. The dominant term 
remains the same — the fixed point of the function E||P n — P||jr r . 

One should ask: why not always use Talagrand's inequality? The reason 
is that it is not always available. Concentration of the supremum of an 
empirical process is known for a class with a bounded diameter in L^. 
Thus, any result which is truly based on this concentration does not extend 
to unbounded classes. Of course, it could be very interesting to develop a 
similar theory for the unbounded case. 

5. Some questions. 

1. Talagrand's concentration inequality is a "function class" version of Bern- 
stein's inequality, with the secondary terms determined by the L2 and 
diameters of F. It could be useful (and not only from the statistical point 
of view) to prove a concentration result with the diameter replaced by 
the tpi diameter (recall that for a > 1, ||-X"||^ a = inf{c > : Eexp(|X|/c) < 
2}; the ipi norm measures the subexponential decay of X). 

2. The results in the paper are based on the behavior of the Rademacher 
process indexed by a random coordinate projection of F (i.e., the restric- 
tion of F onto a random sample). Thus, error bounds are determined 
using random (empirical) metric on coordinate projections. It should 
be interesting to develop a theory of learning which uses "global" metric 
structures. Clearly, the L2(P) one, which is the natural candidate, is too 
weak, for otherwise the supremum of the empirical process indexed by F 
could be controlled in terms of the limiting Gaussian, which is not true. 
It is more likely that stronger metrics (e.g., the ip a metrics) will play a 
central role in such a development, as in [7, 8]. 
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